Enabling Operator Reordering in Data Flow Programs 
Through Static Code Analysis 



Fabian Hueske Aljoscha Krettek Kostas Tzoumas 

Technische Universitat Berlin, Germany Technische Universitat Berlin, Germany Technische Universitat Berlin, Germany 

fabian.hueske@tu-berlin.de aljoscha.krettek@campus.tu-berlin.de kostas.tzoumas@tu-berlin.de 



O 



q 

o 



> 
o 
o 

(N 

o 



x 



Abstract 

In many massively parallel data management platforms, programs 
are represented as small imperative pieces of code connected in a 
data flow. This popular abstraction makes it hard to apply algebraic 
reordering techniques employed by relational DBMSs and other 
systems that use an algebraic programming abstraction. We present 
a code analysis technique based on reverse data and control flow 
analysis that discovers a set of properties from user code, which 
can be used to emulate algebraic optimizations in this setting. 

1. Introduction 

Motivated by the recent "Big Data" trend, a new breed of massively 
parallel data processing systems has emerged. Examples of these 
systems include MapReduce @] and its open-source implemen- 
tation Hadoop jj], Dryad £TT|], Hyracks Of, and our own Strato- 
sphere system j5j]. These systems typically expose to the program- 
mer a data flow programming model. Programs are composed as 
directed acyclic graphs (DAGs) of operators, some of the latter typ- 
ically being written in a general-purpose imperative programming 
language. This model restricts control flow only within the limits 
of operators, and permits only dataflow-based communication be- 
tween operators. Since operators can only communicate with each 
other by passing sets of records in a pre-defined hardwired manner, 
set-oriented execution and data parallelism can be achieved. 

Contrary to these systems, relational DBMSs, the traditional 
workhorses for managing data at scale, are able to optimize qu- 
eries because they adopt an algebraic programming model based 
on relational algebra. For example, a query optimizer is able to 
transform the expression <Jr.x<3{R x (5 m T)) to the expression 
(o#x<3(^) X S) XI T, exploiting the associativity and commuta- 
tivity properties of selections and joins. 

While algebraic reordering can lead to orders of magnitude 
faster execution, it is not fully supported by modern parallel pro- 
cessing systems, due to their non-algebraic programming models. 
Operators are typically written in a general-purpose imperative lan- 
guage, and their semantics are therefore hidden from the system. In 
our previous work ifToll . we bridged this gap by showing that ex- 
posure of a handful of operator properties to the system can enable 
reorderings that can simulate most algebraic reorderings used by 
modern query optimizers. We discovered these properties using a 
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custom, shallow code analysis pass over the operators' code. Here 
we describe this code analysis in detail, which we believe is of in- 
terest by itself as an non-traditional use case of code analysis tech- 
niques. We note that our techniques are applicable in the context 
of many data processing systems which support MapReduce-style 
UDFs such as parallel programming models Q5's, 8], higher-level lan- 
guages (2,0], and database systems 

Related work: In our previous work ifToll we describe and for- 
mally prove the conditions to reorder user-defined operators. That 
paper also contains a more complete treatment of related work. 
Here, we focus on more directly related research. Manimal llllll 
uses static code analysis of MapReduce programs for the purpose 
of recommending possible indexes. Our code analysis can be seen 
as an example of peephole optimization @], and some of the con- 
cepts may bear similarity to techniques for loop optimization. How- 
ever, we are not aware of code analysis being used before for the 
purpose of swapping imperative blocks of code to improve perfor- 
mance of data-intensive programs. 

The rest of this paper is organized as follows. Section [2] de- 
scribes the programming model of our system, and introduces the 
reordering technology. Section [3] discusses our code analysis al- 
gorithm in detail. Finally, Section|4]concludes and offers research 
directions. 



2. Data Flow Operator Reordering 

In our PACT programming model J5|], a program P is a DAG of 
sources, sinks, and operators which are connected by data channels. 
A source generates records and passes them to connected operators. 
A sink receives records from operators and serializes them into an 
output format. Records consist of fields of arbitrary types. To de- 
fine an operator O, the programmer must specify (i) a second-order 
function (SOF) signature, picked from a pre-defined set of system 
second-order functions (currently Map, Reduce, Match, Cross, and 
CoGroup), and (ii) a first-order function (called user-defined func- 
tion, UDF) that is used as the parameter of the SOF. The model is 
strictly second-order, in that a UDF is not allowed to call SOFs. The 
intuition of this model is that the SOF defines a logical mapping of 
the operator's input records into groups, and the UDF is invoked 
once for each group. These UDF invocations are independent, and 
can be thus scheduled on different nodes of a computing cluster. 

Figure QJa) shows an example PACT program. The data flow 
starts with two data sources Srci and Src2 that provide records 
which have the fields [0,1] and [3,4] set respectively (the num- 
bering is arbitrary). Srcj feeds its data into a Map operator with 
a UDF f\ . The Map SOF creates an independent group for each 
input record, and f\ is itself written in Java. UDF f\ reads both 
fields of its input record (0 and 1), appends the sum of both fields 
as field 2, and emits the record. Similarly, the records of Src2 are 
forwarded to a Map operator with UDF fi which sums the fields 3 
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Figure 1. Example Data Flows: (a) original order, (b) first re- 
ordered alternative, (c) second reordered alternative 

and 4, appends the sum as field 5 and emits the record. The outputs 
of both Map operators are forwarded as inputs to a Match operator 
with a UDF and the key field [0] for the first and [3] for the sec- 
ond input. The Match SOF creates a group for each pair of records 
from both inputs that match on their key fields. /3 merges the fields 
of both input records and emits the result. We give the pseudo-code 
of all three user functions in the form of 3-address code fill below. 







20 


f2(InRec $ir) 


10 


fKlnRec $ir) 


21 


$x : =getField ($ir , 3) 


11 


$a:=getField($ir,0) 


22 


$y : =getField ($ir , 4) 


12 


$b:=getField($ir, 1) 


23 


$z:=$x + $y 


13 


$c:=$a + $b 


24 


$or:=create() 


14 


$or : =copy ($ir) 


25 


setField($or,3,$x) 


15 


setField($or,2,$c) 


26 


setField($or,4,$y) 


16 


emit ($or) 


27 


setField($or,5,$z) 






28 


emit ($or) 
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31 
32 
33 



$ir2) 



f3(InRec $irl, InRec 
$or:=copy($irl) 
union($or,$ir2) 
emit ($or) 

The pseudo-code shows the UDF API to process PACT records. 
The user-functions f 1, f2, and f3 receive as input one or two 
input records of type InRec. The only way that a user func- 
tion can emit an output record of type OutRec is by calling the 
emit(OutRec) function. Output records can be either initialized 
as empty (OutRec create ()), or by copying an input record 
(OutRec copy (InRec)). Records can be combined via the func- 
tion void union (OutRec , InRec) . Fields can be read to a vari- 
able via Object getFielddnRec , int) addressed by their po- 
sition in the input record. The value of a field can be set via void 
setField(0utRec , int , Object) . Note that our record API is 
based on basic operations and similar to other systems' APIs such 
as Apache Pig 

Figures [T](b) and (c) show potential reorderings of the original 
data flow (a) where either Map(/\) or Map(/2) has been reordered 
with Match(/3, [0], [3]). While data flow (b) is a valid reordering, 
alternative (c) does not produce the same result as (a). In previous 
work, we presented conditions for valid reorderings of data flow 
operators centered around conflicts of operators on fields lUch . For 
example, since we know that f\ reads fields and 1, and writes field 
2, while fi reads fields and 3, we can conclude that f\ and only 
have a read conflict on field 0, and can thus be safely reordered. 
UDFs that have write conflicts cannot be reordered. This would be 
true if f\ did not append the sum as field 2, but overwrote field 
with the sum. Additional complications arise from the way output 
records are formed. Although on the first sight, /] and fi perform a 
very similar operation, i. e., summing two fields and appending the 
result, there is a fundamental difference. While /] creates its output 
record by copying the input record (line 14), fi creates an empty 
output record (line 24) and explicitly copies the fields of the input 
record (lines 25,26). The side effect of creating an empty output 



record is that all fields of an input record are implicitly removed 
from the output. By reordering Map(/2) with Match(/3, [0], [3]), the 
fields 0, 1, and 2 will get lost since Map(/2) does not explicitly copy 
them into the newly created output record. 

The information that needs to be extracted from the user code 
in order to reason about reordering of operators is as follows. The 
read set Rf of a UDF / is the set of fields from its input data sets 
that might influence the UDF's output, i. e., fields that are read and 
evaluated by /. The write set Wf is the set of fields of the out- 
put data set that have different values from the corresponding input 
field. The emit cardinality bounds \ECf\ and [£Cy] are lower and 
upper bounds for the number of records emitted per invocation of 
/. Reference fioll defines these properties more formally, and pro- 
vides conditions for reordering operators with various SOFs given 
knowledge of these properties. In addition to change the order of 
operators, the optimizer can leverage these properties to avoid ex- 
pensive data processing operations, e. g., a previously partitioned 
data set is still partitioned after a UDF was applied, if the partition- 
ing fields were not modified by the UDF. Moreover, field projec- 
tions can be pushed down based on read set information. 

While it is very difficult to statically derive the exact properties 
by UDF code analysis in the general case, it is possible to con- 
servatively approximate them. In reference lHoTl we discussed this 
static code analysis pass for the simple case of unary operators. In 
the next section, we provide the full algorithm that deals with the 
additional complexity due to binary operators, and provide detailed 
pseudo-code. 

3. Code Analysis Algorithm 

Our algorithm relies on a static code analysis (SCA) framework 
to get the bytecode of the analyzed UDF, for example as typed 
three-address code 01. The framework must provide a control flow 
graph (CFG) abstraction, in which each code statement is repre- 
sented by one node along with a function PREDS(s) that returns 
the statements in the CFG that are "true" predecessors of state- 
ment s, i.e., they are not both predecessors and descendants. Fi- 
nally, the framework must provide two methods Def-Use(s, $v) 
and Use-Def(j\ $v) that represent the Definition-Use chain of the 
variable $v at statement s, and the Use-Definition chain of variable 
$v at statement s respectively. Any SCA framework that provides 
these abstraction can be used. 

The algorithm visits each UDF in a topological order implied 
by the program DAG starting from the data sources. For each UDF 
/, the function VlSIT-UDF of Algorithm Q] is invoked. First, we 
compute the read set R r of the UDF (Iines l7ll0t . For each statement 
of the form $t := getField($ir , n) that results in a valid use 
of variable $t (DEF-USE(g, $t)^ 0) we add field n to Rf 

Approximating the write set Wf is more involved. We compute 
four sets of integers that we eventually use to compute an approx- 
imation of Wf. The origin set Of of UDF / is a set of input ids. 
An integer o e Of means that all fields of the o-th input record of 
/ are copied verbatim to the output. The explicit modification set 
Ef contains fields that are modified and then included in the out- 
put. We generally assume that fields are uniquely numbered within 
the program (as in Figure [T). The copy set Cf contains fields that 
are copied verbatim from one input record to the output. Finally, 
the projection set Pf contains fields that are projected from the out- 
put, by explicitly being set to null. The write set is computed from 
these sets using the function COMPUTE-WRITE-SET (lines 1115b . 
All fields in Ef and Pf are explicitly modified or set to null and 
therefore in Wf. For inputs that are not in the origin set Of, we 
add all fields of that input which are not in Cf, i. e., not explicitly 
copied. 

To derive the four sets, function VlSIT-UDF finds all statements 
of the form e : emit ($or) , which include the output record $or in 



Algorithm 1 Code analysis algorithm 



1: function Compute-Write-Set(/,0/,£/,C/,P/) 

2: W f = E f UPf 

3: for (' e Inputs(/) do 

4: if i ^ Of then W f = W f U (Input-Fields(/, /) \ C f ) 

5: return Wr 

6: function Visit-UDF(/) 

7: R/ = 

8: G = all statements of the form g: $t=getField($ir ,n) 

9: for g in G do 

10: if DEF-USE(g, $t)jt then R f = R f U {»} 

11: £* = all statements of the form e : emit ($or) 

12: (Of,Ef,Cf,P f ) = VlSIT-STMT(ANY(£), $or) 

13: for e in E do 

14: (O e ,E e ,C e ,P e ) = VlSIT-STMT(e,$or) 

15: (Of,Ef,Cf,Pf) = MERGE((Of,Ef,Cf,Pf),(O e ,E e ,C e ,P e )) 

16: return (R f ,Of,E f ,Cf,Pf) 

17: function Visit-Stmt(s, $or) 

18: if visiTED(s,$or) then 

19: return MEMO-SETS(s, $or) 

20: VlsiTED(s,$or) = true 

21: if s of the form $or = create () then return (0,0,0,0) 

22: if s of the form $or = copy($ir) then 
23: return (lNPUT-lD($ir),0, 0,0) 

24: P s = Preds(s) 

25: (O s ,E s ,C s ,P s ) = VlSIT-STMT(ANY(P s ), $or) 

26: for p in P s do 

27: (O p ,E p ,C p ,P p ) = VlSIT-STMT(p,$or) 

28: (O s ,E s ,C s ,P s ) = MERGE((O s ,E s ,C s ,P s ), {O p ,E p ,C p ,P p )) 

29: if s of the form union($or, $ir) then 

30: return (O s UlNPUT-lD($ir),£ s , C S ,P S ) 

31: if s of the form setField($or, n, $t) then 

32: T=USE-DEF(s, $t) 

33: if all teT of the form $t=getFxeld($ir ,n) then 

34: return (O s ,E s ,C s U {n},P s ) 

35 : else 

36: return (O s ,E s U {n},C s ,P s ) 

37: if s of the form setField($or , n, null) then 

38: return (O s ,E s ,C s ,P s U {«}) 

39: function Merge((Oi,£ 1 ,Ci,Pi), {Pz,E 2 ,C 2 ,Pi)) 

40: C = (Ci nC 2 ) U{x\x € Ci,lNPUT-lD(x) G 2 } 
41: u{x\x e C 2 ,lNPUT-lD(x) e 0\} 

42: return {Oi n0 2 ,E, UE 2 ,C,P l UP 2 ) 



the output (linell It. It then calls for each statement e the recursive 
function VlSIT-STMT that recurses from statement e backwards 
in the control flow graph (lines I12I15> . The function performs a 
combination of reverse data flow and control flow analysis but does 
not change the values computed for statements once they have been 
determined. The function ANY returns an arbitrary element of a set. 

The useful work is done in lines 1241381 of the algorithm. First, 
the algorithm finds all predecessor statements of the current state- 
ment, and recursively calls VlSIT-STMT. The sets are merged using 
the MERGE function (Iines l39l42t . MERGE provides a conservative 
approximation of these sets, by creating maximal E,P sets, and 
minimal 0,C sets. This guarantees that the data conflicts that will 
arise are a superset of the true conflicts in the program. When a 
statement of the form setField($or , n, null) is found (line 
|37t , field n of the output record is explicitly projected, and is 
thus added to the projection set P. When a statement of the form 
setField($or, n, $t) is found (line [31}, the Use-Def chain 
of $t is checked. If the temporary variable $t came directly from 
field n of the input, it is added to the copy set C, otherwise it is 
added to the explicit write set E. When we encounter a statement 
of the form $or = create () (line l21t . we have reached the cre- 



ation point of the output record, where it is initialized to the empty 
record. The recursion then ends. Another base case is reaching a 
statement $or = copy($ir) (line [22j where the output record is 
created by copying all fields of the input record $ir. This adds the 
input id of record $ir to the origin set O. A union statement (line 
\29\ results in an inclusion of the input id of the input record $ir 
in the origin set O, and a further recursion for the output record 
$or. The algorithm maintains a memo table Memo-Sets to sup- 
port early exit of the recursion in the presence of loops (line!18t. 
The memo table is implicitly updated at every return statement of 
Visit-Stmt. 

Function VlSIT-STMT always terminates in the presence of 
loops in the UDF code, since it will eventually find the statement 
that creates the output record, or visit a previously seen statement. 
This is due to PREDS always exiting a loop after visiting its first 
statement. Thus, loop bodies are only visited once by the algorithm. 
The complexity of the algorithm is 0(en), where n is the size of the 
UDF code, and e the number of emit statements. This assumes that 
the Use-Def and Def-Use chains have been precomputed. 

The lower and upper bound on the emit cardinality of the UDF 
can be derived by another pass over the UDF code. We determine 
the bounds for each emit statement e and combine those to derive 
the bounds of the UDF. For the lower bound \_ECf\, we check 
whether there is a statement before statement e that jumps to a 
statement after e. If there is none, the emit statement will always 
be executed and we set LEC/J = 1- If such a statement exists, 
statement e could potentially be skipped during execution, so we 
set \_ECf\ = 0. For the upper bound |"£Cy] , we determine whether 
there is a statement after e that can jump to a statement before 
e. If yes, the statement could be executed several times during 
the UDF's execution, so we set \ECf] = +<*>. If such a statement 
does not exist, statement e can be executed at most once so we set 
[£Cy] = 1. To combine the bounds we choose for the lower bound 
of the UDF the highest lower bound over all emit statements and for 
the upper bound the highest upper bound over all emit statements. 

Our previous work lUOIl compares read and write sets which 
are automatically derived by our static code analysis technique and 
from manually attached annotations. We show that our technique 
yields very precise estimations with only little loss of optimization 
potential. However, we note that the estimation quality depends on 
the programming style. 

4. Conclusions and Future Work 

We presented a shallow code analysis technique that operates on 
data flow programs composed of imperative building blocks ("op- 
erators"). The analysis is a hybrid of reverse data flow and con- 
trol flow analysis, and determines sets of record fields that express 
the data conflicts of operators. These sets can be used to "emu- 
late" algebraic reorderings in the dataflow program. Our techniques 
guarantee safety through conservatism and are applicable to many 
data processing systems that support UDFs. Future work includes 
research on intrusive user-code optimizations, i. e., modifying the 
code of UDFs, and on the effects that the use of functional program- 
ming languages to specify UDFs has on our approach and possible 
optimizations. 
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